home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Software Vault: The Gold Collection
/
Software Vault - The Gold Collection (American Databankers) (1993).ISO
/
cdr11
/
mcd_june.zip
/
OCR.TXT
< prev
next >
Wrap
Text File
|
1993-07-01
|
11KB
|
183 lines
From the June Mastering CorelDRAW newsletter
OCR Comes to CorelDRAW 4
Rich Zaleski
With the arrival of version 4, CorelDRAW has made a serious move into the
page layout arena. With 4's enhancements to Draw's bulk text handling and
formatting capabilities, it's only natural that the program's link to scanned-in
images should evolve from a tool that just converts bitmaps to editable vector
files, to one that can also turn scanned bitmap æpictures of textÆ into editable text.
This is done via the new incorporation of Optical Character Recognition (OCR)
functionality into the CorelTRACE program. Trace's implementation of OCR may
not be on the level of dedicated OCR programs, but it is functional and has some
useful features that you might not expect to find in an add-on to what many
perceive as merely an add-on utility itself. In fact, it does a remarkable job of
handling this complex task, especially when you consider that many top-of-the-
line, standalone OCR packages sell for more than the entire Draw 4 suite of
applications.
Users with heavy OCR requirements will still find it advantageous to invest in a
more robust, dedicated OCR application. But those with occasional or limited
need to convert a scanned page of text or incoming fax into editable text, for use
in either Draw, a word processor or simply to save as a simple ASCII text file,
should find Trace's OCR capabilities adequate for their needs.
The OCR Advantage
Uncompressed, a full-page, 1 bit (black-and-white) bitmap in Windows .BMP file
format will occupy the better part of 500 Kb of precious hard disk space. That
same file can be stored in compressed TIFF format, which will cut the file size
down to just over 100 Kb, if the page isn't too tightly packed with text. However, if
like so many users today you're using disk compression software, much of the
advantage usually gained in compressing graphics files is lost, because Stacker,
DoubleSpace or whatever compression scheme is being used can't squeeze the
file much tighter -- it occupies nearly the original amount of real hard disk space.
Compare such bulky file sizes to the 2 or 3 Kb that the same page of text will
occupy when converted to ASCII text format, and the advantage to æOCR-ingÆ
any faxes or scanned-in text files that you need to keep on hand is soon evident.
And, of course, they become editable at the same time.
If you use a scanner, happily the huge bitmap created when scanning pages of
text need not ever be stored on your hard disk. Simply make use of Trace's
TWAIN interface to scan in the image directly, by choosing Acquire Image from
the File menu, then clicking on Acquire. Use Object Linking and Embedding to
æOLE itÆ into PhotoPAINT for cleaning up or deskewing, if necessary, by choosing
Edit Image from the Edit menu. Then in Trace select the area of the page that
you want to convert to text by clicking and dragging a marquee, then click on the
OCR icon.
Memory Considerations
You should keep in mind that OCR is a memory-intensive task. For example, a
full page of text requires over 10 megabytes of memory to process. Even if
you've got plenty of available RAM, you may find it necessary to either maintain
a very large permanent swap file, avoid using Trace's OCR function while other
tasks run in the background, or both. I've choked Trace with a full page of small
type, on a 16 Mb system using a 4 Mb swap file. In this case, shutting down
other applications allowed the job to proceed to completion. If you're relying on a
swap file to provide the needed memory, you have to be willing to accept the
performance degradation that comes with virtual memory usage. (Adjust the size
of your swap file by double-clicking on the 386 Enhanced icon in the Windows
Control Panel.)
A solution to the possibility of not being able to have any other memory-
intensive apps running while you perform an OCR operation is to set up all the
bitmaps on which you need to perform the recognition as a batch trace. Then
start the batch process just before leaving the office for the day, when no other
apps will be running. In any case, you should click on Modify in the Settings
menu, then click on Batch Output, since itÆs here that you set the default output
directory and the file overwrite/make read only options for all of Trace's output.
Trace provides some controls to work with scanned text files of varying quality.
Choose OCR Method by clicking on Modify in the Settings menu. The default is
designed for 300 dpi bitmaps scanned from hard copy of at least laser printer
quality. Settings for dot matrix and fine-quality faxes (200 by 100 dpi) can also be
selected. These settings are sticky, and will remain active until you change them
or select Default from the main Settings menu. How much of a difference do
these settings make? On a one-page test file generated via fax, tracing it in the
Normal, rather than Fax, mode produced a text file with 42 errors. With the OCR
method set to Fax, the same file converted with only a single error.
A Few Rough Spots
You'll also notice an option for Check Spelling in this dialog box. In my tests, I
found this option to be virtually useless. When Draw, or your word processor,
checks spelling and comes across a combination of letters that it doesn't
recognize, it offers you the choice of accepting or correcting the spelling error.
Trace, however, simply ignores the word and doesn't trace it. I'd rather have the
output file say "The spell chec~er needs some improvement," than leave the
word out entirely and give me "The spell needs some improvement." At least in
the latter case the spell checker in my word processor will have something to
catch.
This situation is aggravated by the fact that (as far as I've been able to tell)
Trace's use of the spell checker does not incorporate any user dictionary that
you might have created. Proper names and specialized terms simply get
dropped, rather than being flagged by having the rejected letters converted and
marked with a æ~Æ or some other uncommon character. All in all, I'd strongly
recommend that you give Trace's Check Spelling option a miss.
Another area where the OCR function could stand some improvement is in the
area of text formatting. In short, it doesn't. It's not bad with straight paragraphs of
text, but with columnar data or anything out of the ordinary it just treats each
string of text as a line followed by a return and linespace. In the end, despite the
unexpected accuracy of the character recognition, you're still likely to face some
meaningful editing and reformatting time. Perhaps by the time 5.0 rolls around,
we'll at least see Rich Text Format (RTF) output with some semblance of
maintaining the format of the original image. As long as we're wishing, limited
font identification might be within reach as well.
The Forms Approach
Having stumbled across the weakest feature of Trace's OCR function, it's time to
look at what may be its strongest capability, and is certainly its most intriguing. In
addition to the standard OCR operation of converting to an ASCII text file, you
have the option of using the Forms tracing method. This routine first examines
the bitmap and traces any non-text elements as a graphic in outline and/or
centerline method, as appropriate. It then OCRs the text, but rather than saving
it as ASCII, it inserts it into the usual .EPS output file created by Trace as strings
of Artistic text laid out in the positions appropriate to the image that was traced,
but in the default font. It seems to want to use a sans serif font by default, since
depending on which fonts are in the Ares FontMinder Font Packs I have loaded,
it will be either 12.5-point Avant Garde or Arial. While it's not as fast as straight
OCR tracing, this feature is particularly handy when tracing logos with
accompanying text, letterheads, maps and technical illustrations.
In the accompanying illustrations, I faxed myself a blank invoice and used
Trace's Forms method to convert it to .EPS. The first trace took it just over three
minutes on my 33 MHz 486 with 16 MB of memory, and it never required disk-
based virtual memory. Since Trace does not treat white text on a black
background as text, I then saved the .EPS file, cleared the .EPS window (press
Delete), inverted the image (choose Modify, then Image Filtering from the
Settings menu), and marquee selected the areas containing that text. After
running the Forms trace on these æleftoverÆ text strings, I saved the second .EPS
file under a different name.
I then imported both .EPS files into Draw and placed them side by side. After
ungrouping the .EPS file created with the second scan, I changed the fonts as
necessary and applied a white fill to them before turning my attention to the
other copy. I deleted the curves that represented the white text, used the Node
Edit roll-upÆs Auto-Reduce function on the larger and more complex curves that
made up the form. I changed all the curve segments in the ætableÆ part of the form
to lines, and performed minor cleaning up and aligning by snapping the corners
to the grid.
Finally, I dragged the white text that remained from the second trace on top of
the form. Total time from loading the .PCX scan of the form into Trace to printing
out virtual duplicates of the original from Draw was just over half an hour. Could I
have drawn and lettered the form from scratch more quickly in Draw? I doubt it.
Is it for You?
If you have heavy-duty text conversion needs, you might not ever use Trace's
OCR capabilities, except for perhaps the occasional need to generate a text-
inclusive .EPS trace of a mixed text and graphic bitmap. But then again, if your
OCR needs are that intensive, you didn't buy CorelDRAW to fill them. That's why
Caere and Calera are in business. But for most graphics professionals, who
don't deal in lengthy text documents, TraceÆs OCR capabilities should fill the bill
reasonably well.
Those of you interested in trying out TraceÆs OCR capabilities for yourselves can
use the INV001.PCX file that was placed in the INVOICE directory of this
monthÆs disk when you installed it. This is the scan of the form I discussed in the
article.
TIP
You can also continue an OCR session that halted due to insufficient memory by
closing the warning dialog box, selecting a smaller area to process, and then
doing the page in two passes.
Contents Copyright Kazak Communications 1993
Subscription Information
While the regular subscription rate is $75 per year (in US dollars for Americans,
Canadian dollars for Canadians), charter subscriptions to the Mastering
CorelDRAW newsletter are available for a limited time at $60 (add $30 U.S. for
overseas). A free sample disk, from our exclusive disk-of-the-month service
(value $20), is included with your paid subscription.
To subscribe, or for more information, contact:
Chris Dickman
16 Ottawa St.
Toronto, ON M4T 2B6
Canada
416-924-0759 (voice)
416-924-4875 (fax)
CServe: 70730,2265
- 30 -